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We present the first GPU-based conjugate gradient (CG) solver for lattice QCD with domain-wall 
fermions (DWF). It is well-known that CG is the most time-consuming part in the Hybrid Monte 
Carlo simulation of unquenched lattice QCD, which becomes even more computational demand- 
ing for lattice QCD with exact chiral symmetry. We have designed a CG solver for the general 
5-dimensional DWF operator on NVIDIA® CUDA^'^ architecture with mixed-precision, using 
the defect correction as well as the reliable updates algorithms. We optimize our computation 
by even-odd preconditioning in the 4D space-time lattice, plus several innovative techniques for 
CUDA kernels. For NVIDIA GeForce® GTX 285/480, our CG solver attains 180/233 Gflops 
(sustained). 
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1. Introduction 

Simulation of unquenched lattice QCD with exact chiral symmetry is a grand challenge among 
all sciences. Even for a modest 16^ x 32 lattice with lattice spacing a ~ 0.1 fm, it often requires 
a supercomputer with peak computing power more than 50 Teraflops (e.g., 10 racks of IBM Blue- 
Gene/L). Therefore, only 2-3 lattice QCD groups around the world could afford to perform the 
simulation of unquenched lattice QCD with the domain-wall fermion [^, or the overlap-Dirac 
fermion However, this scenario has been undergoing a dramatic change during the last 12 
months. With the emergence of low-cost and computationally powerful Graphic Processing Unit 
(GPU), now plugging a graphic card with NVIDIA GTX 285 (240 cores, one Teraflops peak) into 
a PC immediately turns the system into a powerful device, attaining a sustained 180 Gflops for the 
conjugate gradient (CG) with mixed precision. 

Since 2009, the Taiwan Lattice QCD Collaboration (TWQCD) has been using a GPU clus- 
ter (currently constituting of 250 NVIDIA GPUs with 40 Teraflops (sustained)) to simulate un- 
quenched lattice QCD with optimal domain- wall quarks ^ We have met the challenge of pre- 
serving the chiral symmetry to a high precision and sampling all topological sectors ergodically. 
For our recent physical results on the 2-flavors QCD, we refer the readers to [Q] and [^]. 

In this paper, we present our design of the first CG solver for the general 5-dimensional DWF 
operator on NVIDIA CUDA architecture with mixed-precision, using the defect correction as well 
as the reliable updates algorithms. Our CG solver is optimized with even-odd preconditioning 
on the 4-dimensional space-time lattice, plus several innovative tuning techniques for the CUDA 
kernels. For NVIDIA GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained). 

2. Conjugate Gradient Solver for Domain- Wall Fermions 

2.1 Optimal Domain- Wall Fermions 

Mathematically, for a given A^^ (the number of sites in the fifth dimension), the maximal chiral 
symmetry can be attained by the optimal domain-wall fermion (ODWF) |^] with the operator 

[&{my)]xx';s,' = {(OsD„ + l)xv'4.v' + (a,Av - l).rf^.,v', (a, = W,) (2.1) 

Here D„ is the standard Wilson Dirac Operator plus a negative parameter —mo (0 < wiq < 2), 

{D„),^ = -\Y.[{\-y^)U^{x)5,+f,^ + {\ + y^)Ul{x')5,^f,^^]+{d-mo), (2.2) 

where U^{x) denotes the link varaible, d is the dimension of the space-time (<i = 4 for QCD), 

L = P+L+ + P L , /'± = (l±75)/2, (2.3) 

and 

(T \ j Ss-l/, 1<S<N, J 

(L+j,v = < / /o NO , = (2-4) 

-[mq/2mo)ON,j', s = 1 

The weights {ft)s} along the fifth dimension are fixed according to the formula derived in [^] such 
that the maximal chiral symmetry is attained. In general, for other DWF with non-maximal chiral 
symmetry, the weights {Os} and {ft),v} have different values, e.g., for the conventional (Shamir) 
DWF, Os = 0,(0, = 1,V5, and for the Borici DWF [^], a, = (O, = 1, V^. 
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2.2 Even-Odd Preconditioning 



Since D^, commutes with {(o)ss' = Wi^vv' ^^id (a)sv = Osd^s', Eq. becomes 

^(m^) =D„,(w + aL) + (l-L). 



(2.5) 



Separating the even and the odd sites on the 4D space-time lattice, Eq. ( |2.5D can be written as 



dS^f X 



where 



X= (<i-mo)a)(l+cL) + (l-L), Y = co{l+cL), (c),,' (a,/a),05„ 
Now we further rewrite it in a more symmetric form by defining 



-1 , 



-1 /7T-1 



and 



Si=^/(o Wx-^ =M5V(0 \ S2 = Y-^,/^. 



then Eq. (2.6) becomes 



-1 / 1 ^5£>S' 



1 OWlOWlMsD^- 



EO^ 



M5D 



OE 



1 



1 / \ C / \ 1 



C=1-M5D2^M5D^^. 



(2.6) 
(2.7) 

(2.8) 
(2.9) 

(2.10) 
(2.11) 



Obviously, the most time-consuming task in the HMC is to solve the linear system CC^\x) = \b) 
by the conjugate gradient (CG), namely, in the computation of the fermion force in the molecular 
dynamics. In this work, we implement the entire conjugate gradient inside the NVIDIA GPU, 
which can be used for the HMC as well as for computing the valence quark propagators. 

2.3 Algorithm 

Conjugate Gradient (CG) method [||] is a widely-used numerical algorithm for iteratively solv- 
ing a linear system Ax = bto a. certain precision e, where A is a positive-definite Hermitian matrix. 
With the CG algorithm (see Algorithm |l]), the problem is turned into a task dominated by the 
matrix-vector multiplication. In this work, we utilize CUDA to implement the 5D domain-wall 
fermion operator ( 2. 10| ) matrix- vector multiplications of the CG on NVIDIA GPUs. For the GPU, 
the single-precision operations are several times faster than the double-precision ones, thus it is ad- 
vantageous to use the mixed-precision CG. In the so-called defect correction algorithm, one solves 
X in the low-precision, and updates the solution x and the residue r in the high-precision, (see 
Algorithm ^, where the hatted symbols represent variables in the high-precision). In this fashion, 
most of the floating-point operations are in the low-precision, thus it is advantageous for the GPU. 
Theoretically, the defect correction algorithm is mathematically sound, and it always works in prac- 
tice. However, the seemingly drawback is that one has to build up the Krylov space every time it 
restarts the CG in the low precision. On the other hand, if one does not reset the low-precision p 
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Algorithm 1 Conjugate Gradient 
xo := initial guess 
ro ■=b-Axo 

Po ■= ro 
k:=0 

while \rk\ > e\b\ do 

OCk ■■= in,rk)/ {pk,APk) 
rk+i ■= rk - akApk 
h+\ '■= {rk+\,rk+\) / {rk.rk) 
xk+\:=xk + akPk 

Pk+i ■= rk+i+Pk+iPk 
k:=k+l 
end while 



Algorithm 2 Mixed-Precision Conjugate Gradient (Defect Correction) 
X := initial guess 
r := b—Ax 
while \rk\ > £\b\ do 

r:=f 

p := r 

x:=0 

Use Algorithm |l| to solve x = V in the low -precision to a precision e 
X :=x + x 
r := b —Ax 
end while 



vector inside the while loop of Algorithm ^ (i.e., skipping the step (p := r) except at the first time), 
the "warm-up" time in re-building the Krylov space could be reduced. This so-called reliable up- 



dates algorithm [^, 10] would run faster than the defect correction. Although the reliable updates 
in this fashion may not converge for all cases due to the non-orthogonality of p and r, in practice 
it seems to work for most cases. We have implemented both algorithms in our CG solver, and it 
automatically switches to the defect correction if the reliable updates does not converge in the first 
place. 

3. CUDA Kernels and Optimization 

The CUDA architecture developed by NVIDIA enables us to do parallel computations on 
NVIDIA's GPUs. (For more detailed programming models and strategies, see "CUDA Program- 



ming Guide for CUDA Toolkit" [|11|].) 

In CUDA, a thread is the smallest unit to execute a kernel, and a certain number of threads 
form a block. Each thread will be assigned a 3-dimensional index, and each block a 2-dimensional 
index. Inside a block, a warp of threads will execute the kernel concurrently. Each thread has its 
own register memory, and threads in the same block share a shared memory. The space of register 
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and shared memory is very limited, but they have the fastest access speed. The global memory on 
the device can be accessed by any thread, but it has the slowest bandwidth. 

To implement the mixed-precision CG for ODWF with CUDA, we perform all matrix-vector 
multiplication, vector reduction (inner product), and vector addition/subtraction on the GPU (de- 
vice), while the CPU (host) is used to do the flow control, memory assignment, and device control. 

The CUDA kernels in our CG solver can be divided into five different catalogs. We will discuss 
each catalog and their optimization in the following subsections. 

3.1 Vector Index Conversion 

These kernels are used to change the indexing schemes between the device (GPU) and the 
host (CPU). To store the Dirac spinors in an one-dimensional array, we need to map the multi- 
dimensional indices to the ID array index. One needs 4 x 3 x 2 = 24 real numbers to store one 
Dirac spinor at each site of the 5D lattice. On the CPU, this color-spinor index c which runs from 
to 23 is the inner-most (fastest-running) index, which is followed by the fifth-dimension index 
s, and then x,y,z,t indices, where t is the outer-most (slowest-running) index. If //,o.s? denotes the 
one-dimensional array index of the CPU scheme, then we have 

=c + sx24 + xx 24N, H htx lAN^NyN^N, . (3.1) 

However, for computation on the GPU, we assign each thread a five-dimensional site index. This 
implies that adjacent threads have consecutive s indices. Thus we want to arrange the data such 
that optimal coalescence is attained when loading the vector from device global memory to the 
register and/or the shared memory of the threads. Since the GPU provides vector data types such as 
float 4 and double2 which allow us to move 4 single precision numbers (or 2 double precision 
numbers) at one time, a simple way to map to the one-dimensional array index on GPU (for single- 
precision) is 

/dev = c mod 4 + 5x4 + [c/4] x AN, + x x 24A^^ H h f x lAN^NyN^Ns, (3.2) 

and similarly for the double-precision. So every time when we transfer data between the host and 
the device, we convert the index accordingly. 

3.2 Matrix- Vector Multiplication for D^^{Df^) 

D'^^{D^P) is similar to the usual Wilson-Dirac operator D„ without the mass term, 

[£'w^(£»w°)] = -^I [(1 - r,)U,m+af^y + (1 + 7,)t/;(x )4-.A/] • (3-3) 

From this expression we see that the multiplication of D^^(D^*^) with a vector involves the link 
variables U^{x) and /-matrices. We have used the following tricks for optimization. 



Firstly, since the y-matrices in Eq. (3.3) are in the combination (1 ± 7^), the left-handed and 
the right-handed Dirac components are related to each other. Also, since the link variables do not 
have Dirac indices, we can just multiply (x) to the left-handed components, and then restore the 
right-handed components. 
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Secondly, (x) has no fifth-dimension dependence, so threads having the same x but different 
s can share the same U^{x). So we put the link variables in the shared memory. 

Thirdly, because GPU computation is memory bandwidth bound, one must try to reuse the 
data. For example, the hopping term {dx-afi^) in D^, all neighboring sites of x are involved in 
the calculation. If we assign each x to one thread, then there must be overlapping data loading for 
neighboring sites. To reduce this overlapping data transfer, we distribute each (x,j,z) to one thread, 
with a loop in the ^-direction. Then the neighboring data in the f-direction can be reused, and the 
efficiency is enhanced. 

Besides above tuning techniques, we also expand small loops, and to use the texture memory 
for caching data. Here texture is used for loading the vectors and link variables. We use Python 



[12] to expand small loops, and to generate the set of kernels for D„, multiplication. 



3.3 Matrix- Vector Multiplication for Ms 



The matrix M5 is given in Eq. ( |2.8| ). One can see that M5 is block diagonal in the chiral basis 
and it does not depend on the space-time nor the color indices. In fact, it can be divided into two 
constant matrices in the fifth dimension, i.e., the left-handed and the right-handed ones. So the 
multiplication of M5 with a vector can be regarded as = Y.s' {^5) ss'V s' ■ Here we use the shared 
memory to store the source vector (v). Since M5 only takes IN^ real numbers, we can put M5 into 
the register of each thread (with different s and x,y,z)- Again, a loop in t is inserted to reuse the M$ 
data in the register. Also, we use Python to generate these kernels. 



3.4 Vector Reduction 

To calculate the norm of a vector, we use the well-known parallel reduction with the shared 
memory. However, due to the limitation on the number of threads per block, it is inevitable to 
use global memory when the size of a vector becomes very large. Our solution is to perform the 
block reduction in prior kernels, i.e., to sum up vector elements (already stored in registers/shared 
memory) within each block. Then these partial sums can be added with a parallel reduction. 



3.5 Vector Addition and Subtraction 

We can combine the simple addition/subtraction with other kernels in which one vector has 
been loaded. For example, to multiply C = 1 — M^D'^^MsD^^ to a vector, we can combine the last 
subtraction with the last M5 multiplication. 



4. Performance 



We present some benchmarks of our CG solver, using NVIDIA GeForce GTX 285, GeForce 
GTX 480, Tesla™ €1060, and Tesla C2050. Note that our code has not yet been well-tuned for the 
Fermi architecture (GTX 480 and C2050). From Table |TJ we see that the bottleneck of our program 
is in the single-precision D„ matrix- vector multiplication. Due to the mixed-precision CG, the time 
used in the double-precision operations are almost negligible. For the Fermi architecture, due to 
the larger LI cache, there is a significant improvement in the single-precision D^^. matrix- vector 
multiplication, and also in the double-precision M5 matrix-vector multiplication. 
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Table 1: Benchmark of our CG solver for DWF on a 16^ x 32 x 16 lattice, numbers in units of Gflops) 





D„, (single) 


Ms (single) 


D,, (double) 


Ms (double) 


CG (mixed) 


GTX 285 


177 


346 


33 


69 


181 


GTX 480 


248 


331 


32 


116 


233 


CI 060 


128 


290 


29 


61 


132 


C2050 


160 


239 


22 


100 


156 



5. Summary 

We have implemented an efficient GPU-based CG solver for generalized domain-wall fermions 
in lattice QCD. Our CUDA kernels are tuned with several innovative techniques. On NVIDIA 
GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained). This efficient CG 
solver constitutes the most crucial part in TWQCD's HMC code for simulation of unquenched 
lattice QCD with the optimal domain-wall fermion. 
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